Rank | Abbreviation | Frequency |
---|---|---|
3657 | tzv. | 575 |
4052 | tj. | 513 |
4279 | dr. | 485 |
4608 | prof. | 447 |
4777 | d.o.o. | 433 |
7007 | I. | 284 |
8758 | II. | 225 |
8825 | M. | 223 |
9939 | npr. | 195 |
11387 | D. | 166 |
11546 | A. | 163 |
11559 | d.d. | 163 |
13282 | V. | 138 |
13437 | J. | 136 |
13868 | S. | 131 |
14417 | B. | 125 |
15344 | bh. | 116 |
16685 | N.K. | 104 |
16937 | R. | 102 |
17339 | P. | 99 |
If the tokenizer used for word tokenization knows abbreviations (ending with full stop; only those abbreviations may yield to problems during sentence segmentation), these are included in the word list as words ending in a full stop. The list shows the 50 most frequent words of this form.
The list should contain the most frequent abbreviations ending with full stop. If important abbreviations are missing, they might be considered as sentence boundaries by the sentence segmentizer.
If the list is generated automatically, it also may contain non-abbreviations. There may be missing sentence boundaries at the corresponding positions.
The confusion between abbreviations and sentence boundaries exists only for those languages which use the same punctuation mark for both. This is not the case for languages like Chinese or Japanese.
select w_id-100,word,freq from words where word like "%." and word not regexp "^[0-9]*\.$" limit 30;